import os.path
from io import BytesIO
from urllib.request import urlopen
from zipfile import ZipFile
# Download and unzip data files (pickle files) from s3 bucket
data_path = 'data/'
if not os.path.exists(data_path):
data_zip_url = 'https://s3-us-west-1.amazonaws.com/udacity-selfdrivingcar/traffic-signs-data.zip'
with urlopen(data_zip_url) as zip_resp:
with ZipFile(BytesIO(zip_resp.read())) as zfile:
zfile.extractall(path=data_path)
The pickled data is a dictionary with 4 key/value pairs:
'features' is a 4D array containing raw pixel data of the traffic sign images, (num examples, width, height, channels).'labels' is a 1D array containing the label/class id of the traffic sign. The file signnames.csv contains id -> name mappings for each id.'sizes' is a list containing tuples, (width, height) representing the original width and height the image.'coords' is a list containing tuples, (x1, y1, x2, y2) representing coordinates of a bounding box around the sign in the image. THESE COORDINATES ASSUME THE ORIGINAL IMAGE. THE PICKLED DATA CONTAINS RESIZED VERSIONS (32 by 32) OF THESE IMAGESFrom these randomly sampled images, we can see that many of the signs vary in brightness. This can definitely make classification more difficult. We also note that the signs tend to come in one of two shapes, a circle or a triangle. We also note that the signs generally mainly use some combination of red, white, and blue. Lastly, many of the signs are nearly identical in color and shape but differ from the icon within the sign.
We can observe there is a significant class imbalance with some classes with approximately $2000$ images while eleven classes have less than $300$ images.
Before creating a CNN, we try out using a Random Forest model on the image statistics we defined above. This is just to get a feel for the difficulty of this problem.
As we see above, the model doesn't perform great (only about a $13\%$ accuracy). This is much better than purely guessing but we likely can do much better than this underfitting model.
We start our data preprocessing with one-hot encoding the targets. This is necessary for training since we don't want the model to learn from the numerical labeling ("label encoding"). For fun, we observe what happens when we train off these label encoded targets.
We also normalize the pixels in the image with the suggestion of subtracting $128$ and dividing the sum by $128$. This effectively transforms the pixels' range from $(0,255)$ to $(-1,1)$ which is easier for training a neural network.
We actually skip converting the images to grayscale since there is a lot information provided by the color of the signs (blue vs red). It's arguable that color is unnecessary for recognition, the color feature likely would be helpful for model to train off of.
We've actually designed a few model architectures going from simple to more complex (deeper and more parameters). We base each of the three different models on convolutional layers following a pattern of two convolutions before a max pool layer. The convolutions increase in complexity by adding more filters as the we progress into the deeper layers. We apply batch normalization throughout to help speed up training and provide some regularization.
Note we also define a DefaultConv2D function to create convolutional layers using the ReLU activation function and the He normal initializer for the kernel since this has been shown to do well for the ReLU activation function.
Below are the three architectures used (ordered by increasing complexity):
Now we come to actually train the models and evaluate their performances. We will train and evaluate on three different architectures so we'll be testing all three with the test set. Note we observe a training without one-hot encoding but will not consider its results as valid for our final test.
For the other models, we train each CNN with a batch size of $128$ to help speed up convergence. We also set each to train for $150$ epochs though we implement early stopping so after $8$ epochs with no change in the validation accuracy greater than $0.01%$, the training will stop. This ensures we don't accidentally drastically overfit to our data. We generally observe the models stop training after about $50$ epochs.
Lastly, we include a checkpoint to only save the best weights of the models (based on the validation loss). This allows us to go back to a better model at the end if the early stopping callback stops the training. We also use Nesterov momentum with the Adam optimizer since Adam tends to do quite well in helping models to converge, especially when paired with (Nesterov) momentum.
Since evaluation is very similar for each model, we created a short function to plot the accuracy and loss for both the training and validation using the model's history.
Here we observe what happens when no one-hot encoding is done for a relatively simple CNN. The observant reader will see that this network does quite will with little training (less than 20 epochs) compared to the other models being training after one-hot encoding later in this notebook. These results seem dubious since the only real change is how the targets have been labeled. So these results will not be considered for the project.
We created a encapsulating method to help us select, compile, and train our different models. This allows to train successive models and then evaluate them immediately after training before automatically moving on to the next model.
Note that while we train different models, we use the same batch size, number of epochs, and data (training, validation, and testing sets) for all the models.
We observed that our basic_cnn-32cov-64cov-dense model performed the best; it had the smallest validation loss ($0.18$), the best validation accuracy (more than $94\%$), and performing the well on the test set ($~92.5\%$ accuracy).
Though one of the more complex models performed slightly better than the simplest one trained, we're going to chose on using the basic model since we should be less likely to be overfitting using a simpler model.
Next, we'd like to find new traffic images to test on.
The following were images found via Google Street Map near Wattenscheid, Germany. Below the images are shown with links to these images with the matching class from signnames.csv and the link to the source image. Overall, images seem relatively ideal with decent resolution and very little background. However, the Yield sign is slight askew which could reduce the chances of it being classified correctly. We also note that although background behind the signs doesn't seem particularly different from our training sets, it's possible that this will have an effect on the accuracy on the classification of these new images.
We'll rescale the images to a 32x32 (RGB) images so we can use it on our model.
Below we plot the original image with its class name and then a random image from the training set (with that class name) for each of the top five predictions by the model as well as the associated certainty.
As we can see, on the Speed limit (80km/h) correctly identified with its highest certainty. Only the Yield image had the correct image within the top five guesses.
In general, the model tends to be relatively uncertain about its guesses with the exception of the No vehicles image where the model incorrectly guessed with almost a $93\%$ certainty. This general uncertainty shows that the model is having a difficult time in determining the signs. We also not that the model's predictions tend towards "reasonable" guesses such as picking images of the same shape (round vs angles) or similar colors (blue vs red & white).
Overall, the performance of the basic_cnn-32cov-64cov-dense model was not as expected. This might mean that accuracy was not necessarily the metric to judge initial performance. Better metrics, in hindsight, would be to look at precision and recall and perhaps using a confusion matrix to visualize the performance per class. We likely need a more complex model to overcome this underfitting. This might mean going with a more complicated model even if accuracy is lower; performance per class could give a better idea of the true model performance in relation to real-world (desired) performance.